Introduction to Open Data Science - Course Project

About the project

Ruminations about the Introduction to Open Data Science (IODS) 2021

Feelings

The course Introduction to Open Data Science (IODS) feels like a good introduction to the R-Rstudio-git pipeline advocating for reproducible well-documented research.

Challenges may arise due to

  • various backgrounds of the students
  • technological hurdles

Outcomes

I hope the course will generate mindsets favoring open research equipped with technical skills to produce good research output.

I hope to learn not only technical skills but also ways to endorse open data science.

Where to find the course

I found the course in Sisu. Apparently it is also found in the MOOC platform.

Repo

A fork of the course can be found at my repo.


Week 2 - Simple linear regression

date()
## [1] "Thu Nov 25 21:20:48 2021"

Introduction

This week, we are analyzing and modeling a survey data concerning the approaches to learning. Description of the data can be found here and here.

The data has been preprocessed to include variables on gender, age, attitude, exam points, and scores on deep, surface and strategic questions.

Dataset

Begin by loading the analysis data. Let’s use the “official” csv-file; an R script to produce the wrangled data is in the repo.

# read in the data
learning2014 <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", header = TRUE, sep = ",")

dim(learning2014)
## [1] 166   7

There are 166 rows and 7 columns.

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The variables are mostly numeric. Gender is of character value, change it to factor.

# change to factor
learning2014$gender <- as.factor(learning2014$gender)

summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Seems like most of the students are female, age spans 17-55, exam points are between 7-33 and the question variables are capped at 5 (from documentation, on a scale 1-5).

Graphical overview

Using GGlally::ggpairs, we can plot the distribution of the variables, correlation (with significance), boxplots and scatterplots stratified by the gender.

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), 
             lower = list(combo = wrap("facethist", bins = 20)),  
             upper = list(continuous = wrap("cor", size = 3)))

p

We see that

  • Most students are female
  • Age is right skewed (more of smaller age)
  • There is a positive correlation between points and attitude
  • There is, as expected, a negative correlation between surf and deep questions
  • There is a negative correlation between surf and attitude
  • There is a negative correlation between surf and stra questions
  • Distribution of the exam points is not perfectly normal (can be tested with e.g. stats::shapiro.test)

Linear regression

For pedagogical reasons, let’s fit a simple multivariable linear regression first with four (instead of three; choosing three would not change the results) explanatory variables, ie. regress exam points on other variables. Choose variables based on their absolute correlation with the exam points.

sort(abs(cor(learning2014[, -1])[, c("points")]))
##       deep        age       surf       stra   attitude     points 
## 0.01014849 0.09319032 0.14435642 0.14612247 0.43652453 1.00000000
model <- lm(points ~ attitude + stra + surf + age, data = learning2014)

summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf + age, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.8639  -3.2849   0.3086   3.4548  10.7028 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.70644    3.96234   3.459 0.000694 ***
## attitude     3.38964    0.57041   5.942 1.68e-08 ***
## stra         0.93123    0.53986   1.725 0.086459 .  
## surf        -0.76565    0.80258  -0.954 0.341524    
## age         -0.09466    0.05346  -1.771 0.078502 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.262 on 161 degrees of freedom
## Multiple R-squared:  0.2226, Adjusted R-squared:  0.2033 
## F-statistic: 11.52 on 4 and 161 DF,  p-value: 2.987e-08

From the summary, the intercept is very significantly non-zero based on a t-test (H_0: intercept is zero, H_1: intercept is not zero) although the result is not very meaningful as e.g. age cannot be zero in this context.

The attitude variable has a very significant effect on the exam points (p-value about zero for a t-test for H_0: attitude has no effect on the slope when the other variables are kept constant, H_1: attitude has an effect). Variables stra and age are weakly significant or non-significant depending on the nomenclature (same test). Variable surf is not significant, let’s remove it and refit.

model <- lm(points ~ attitude + stra + age, data = learning2014)

summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     3.48077    0.56220   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## age         -0.08822    0.05302  -1.664   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

Now all explanatory variables are significant with a 0.10 significance cutoff.

The F-test is significant, some of the variables are therefore associated with the response variable (have non-zero coefficient).

Interpretation of the linear model

From the summary above, we see that as attitude increases by one, exam points increase by about 3.48 when other variables are kept constant. For stra, the value is about 1.00. Age seems to have a smaller effect, with an additional year decreasing exam points by around 0.09 if other variables do not change.

Multiple R-squared is 0.2182, meaning that the three explanatory variables account for approximately 22% variation in the exam points. This seems quite low and we should be careful when e.g. predicting exam points for new students with this model. The adjusted R-squared considers the number of explanatory variables and has here a similar, a bit smaller, value to the unadjusted R-squared.

Diagnostics

Linear models carry assumptions that need to be verified. We use diagnostic plots for visual assessment.

Residuals versus fitted

# residuals vs fitted values
plot(model, which = 1)

There doesn’t seem to be a clear pattern (maybe visually somewhat of a funnel with decreasing variance but Breusch-Pagan test car::ncvTest(model) doesn’t detect heteroskedasticity) in the residuals versus fitted values in the sense that the variance in residuals is similar across the fitted values. Also, the variation is approximately symmetrical around zero. Constant variance assumption appears to be met. Linear model appears to be OK for the data.

Three outliers (set by id.n in plot.lm) are detected (rows 35, 56 and 145), let’s check them.

print(learning2014[c(35, 56, 145), ])
##     gender age attitude     deep  stra     surf points
## 35       M  20      4.2 4.500000 3.250 1.583333     10
## 56       F  45      3.8 3.000000 3.125 3.250000      9
## 145      F  21      3.5 3.916667 3.875 3.500000      7
annotate <- c(35, 56, 145, 4, 2)  # 4 and 2 for leverage below

# make the plot tighter
op <- par(mfrow=c(3, 1),
          mar = c(4, 4, 1, 1),
          mgp = c(2, 1, 0))

parameters <- c("attitude", "stra", "age")

for (param in parameters) {
  print(param)
  plot(learning2014[, param], learning2014$points, 
       xlab = param, ylab = "points")
  
  text(learning2014[annotate, param], 
       learning2014[annotate, "points"], 
       annotate, pos = 4, col = "red")
}
## [1] "attitude"
## [1] "stra"
## [1] "age"

par(op) # return original par

It appears that here are the students that got low exam points despite having a high attitude value.

Plot boxplot for assessment of symmetry of the residual distribution.

boxplot(resid(model))

Does not seem too bad, but is not perfectly symmetric. QQ-plot will aid in deciding normality.

QQ-plot

# qq-plot
plot(model, which = 2)

Standardized residuals follow the linear pattern quite reasonably with the same outlier exceptions as above. Hence, there is no strong reason to suspect deviation from normality in the distribution of the residuals.

Residuals versus leverage

# residuals vs. leverage
plot(model, which = 5)

Observations 2, 4 and 56 have been marked as outliers, they have relatively high influence of the regression line compared to other observations. We can rerun the model without these to see if the multiple R-squared is increased.

Cook’s distance is also low, supporting the notion that there are no outliers having a great effect on the linear fit.

max(cooks.distance(model))
## [1] 0.1202162
summary(lm(points ~ attitude + stra + age, data = learning2014[-c(2, 4, 56), ]))
## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014[-c(2, 
##     4, 56), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.822  -3.520   0.348   3.617  10.919 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.38472    2.58566   3.243  0.00144 ** 
## attitude     3.64594    0.53695   6.790 2.11e-10 ***
## stra         0.84219    0.51066   1.649  0.10108    
## age          0.01968    0.05677   0.347  0.72930    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.008 on 159 degrees of freedom
## Multiple R-squared:  0.2419, Adjusted R-squared:  0.2276 
## F-statistic: 16.91 on 3 and 159 DF,  p-value: 1.394e-09

Multiple R-square is indeed a bit better. However, removal of the outliers would have to be justified from the data (e.g. are these students somehow different).

VIF

library(car)
## Loading required package: carData
vif(model)
## attitude     stra      age 
## 1.004083 1.014233 1.010865

Variance inflation factors are less than 10 (textbook), so there is no concern for collinearity.

AIC

We can try automatic stepwise model selection by AIC criterion.

library(MASS)
model.full <- lm(points ~ gender + age + attitude + deep + stra + surf, 
                 data = learning2014)
stepAIC(model.full, direction = "both")
## Start:  AIC=558.36
## points ~ gender + age + attitude + deep + stra + surf
## 
##            Df Sum of Sq    RSS    AIC
## - gender    1      0.11 4408.5 556.36
## - surf      1     47.79 4456.1 558.15
## - deep      1     49.00 4457.3 558.19
## <none>                  4408.3 558.36
## - stra      1     83.78 4492.1 559.48
## - age       1     87.88 4496.2 559.63
## - attitude  1    919.82 5328.2 587.82
## 
## Step:  AIC=556.36
## points ~ age + attitude + deep + stra + surf
## 
##            Df Sum of Sq    RSS    AIC
## - surf      1     47.69 4456.1 556.15
## - deep      1     49.11 4457.6 556.20
## <none>                  4408.5 556.36
## - stra      1     88.35 4496.8 557.66
## - age       1     90.16 4498.6 557.72
## + gender    1      0.11 4408.3 558.36
## - attitude  1    999.18 5407.6 588.27
## 
## Step:  AIC=556.15
## points ~ age + attitude + deep + stra
## 
##            Df Sum of Sq    RSS    AIC
## - deep      1     26.61 4482.8 555.14
## <none>                  4456.1 556.15
## + surf      1     47.69 4408.5 556.36
## - age       1     75.36 4531.5 556.93
## - stra      1    106.07 4562.2 558.05
## + gender    1      0.02 4456.1 558.15
## - attitude  1   1084.38 5540.5 590.30
## 
## Step:  AIC=555.14
## points ~ age + attitude + stra
## 
##            Df Sum of Sq    RSS    AIC
## <none>                  4482.8 555.14
## - age       1     76.62 4559.4 555.95
## + deep      1     26.61 4456.1 556.15
## + surf      1     25.20 4457.6 556.20
## - stra      1     97.64 4580.4 556.71
## + gender    1      0.01 4482.8 557.14
## - attitude  1   1060.72 5543.5 588.39
## 
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
## 
## Coefficients:
## (Intercept)          age     attitude         stra  
##    10.89543     -0.08822      3.48077      1.00371

The produced model is the same we derived earlier.

Brute force

Finally, we can try to brute force all linear models of simple combination of explanatory variables.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
## 
##     cement
## The following object is masked from 'package:datasets':
## 
##     rivers
res <- olsrr::ols_step_all_possible(model.full)

res[order(res$adjr, decreasing = TRUE), ]
##    Index N                         Predictors     R-Square Adj. R-Square
## 62    57 5        age attitude deep stra surf 0.2311318189   0.207104688
## 33    22 3                  age attitude stra 0.2181719645   0.203693668
## 52    42 4             age attitude deep stra 0.2228135019   0.203504521
## 54    43 4             age attitude stra surf 0.2225665343   0.203251417
## 63    63 6 gender age attitude deep stra surf 0.2311510113   0.202137842
## 43    44 4           gender age attitude stra 0.2181729919   0.198748718
## 57    58 5      gender age attitude deep stra 0.2228168858   0.198529914
## 59    59 5      gender age attitude stra surf 0.2226046106   0.198311005
## 53    45 4             age attitude deep surf 0.2157221484   0.196236984
## 56    46 4            attitude deep stra surf 0.2154077956   0.195914822
## 17     7 2                      attitude stra 0.2048096617   0.195052725
## 38    23 3                 attitude deep stra 0.2096692889   0.195033535
## 34    24 3                  age attitude surf 0.2081990315   0.193536051
## 40    25 3                 attitude stra surf 0.2074263400   0.192749050
## 58    60 5      gender age attitude deep surf 0.2165385715   0.192055402
## 12     8 2                       age attitude 0.2011434849   0.191341564
## 61    61 5     gender attitude deep stra surf 0.2158241415   0.191318646
## 27    26 3               gender attitude stra 0.2050965812   0.190376147
## 48    47 4          gender attitude deep stra 0.2098633360   0.190232611
## 32    27 3                  age attitude deep 0.2043132366   0.189578297
## 44    48 4           gender age attitude surf 0.2090640320   0.189413449
## 50    49 4          gender attitude stra surf 0.2079025033   0.188223062
## 39    28 3                 attitude deep surf 0.2023788945   0.187608133
## 22    29 3                gender age attitude 0.2017804149   0.186998571
## 3      1 1                           attitude 0.1905536672   0.185618019
## 18     9 2                      attitude surf 0.1952865878   0.185412804
## 42    50 4           gender age attitude deep 0.2048827672   0.185128302
## 49    51 4          gender attitude deep surf 0.2040700385   0.184295381
## 16    10 2                      attitude deep 0.1939911024   0.184101423
## 28    30 3               gender attitude surf 0.1970278452   0.182157990
## 8     11 2                    gender attitude 0.1919357660   0.182020867
## 26    31 3               gender attitude deep 0.1952588802   0.180356267
## 47    52 4               gender age stra surf 0.0653293394   0.042107708
## 60    62 5          gender age deep stra surf 0.0707276010   0.041687839
## 37    32 3                      age stra surf 0.0520482669   0.034493605
## 55    53 4                 age deep stra surf 0.0568671481   0.033435276
## 24    33 3                    gender age stra 0.0504456777   0.032861338
## 31    34 3                   gender stra surf 0.0462022812   0.028539360
## 45    54 4               gender age deep stra 0.0514840047   0.027918390
## 51    55 4              gender deep stra surf 0.0510011757   0.027423565
## 21    12 2                          stra surf 0.0363412029   0.024517169
## 25    35 3                    gender age surf 0.0420448279   0.024304917
## 41    36 3                     deep stra surf 0.0407134651   0.022948900
## 10    13 2                        gender stra 0.0346692266   0.022824677
## 46    56 4               gender age deep surf 0.0462679619   0.022572756
## 15    14 2                           age surf 0.0340077514   0.022155086
## 14    15 2                           age stra 0.0331743748   0.021311484
## 36    37 3                      age deep surf 0.0379369858   0.020121004
## 29    38 3                   gender deep stra 0.0357520229   0.017895579
## 35    39 3                      age deep stra 0.0336895603   0.015794923
## 5      2 1                               stra 0.0213517768   0.015384410
## 6      3 1                               surf 0.0208387755   0.014868280
## 11    16 2                        gender surf 0.0267878521   0.014846599
## 30    40 3                   gender deep surf 0.0306219089   0.012670463
## 20    17 2                          deep surf 0.0244545065   0.012484623
## 19    18 2                          deep stra 0.0219453518   0.009944681
## 7     19 2                         gender age 0.0196556536   0.007626889
## 2      4 1                                age 0.0086844365   0.002639829
## 1      5 1                             gender 0.0086318623   0.002586935
## 23    41 3                    gender age deep 0.0198419947   0.001690921
## 9     20 2                        gender deep 0.0088743608  -0.003286690
## 13    21 2                           age deep 0.0087454938  -0.003417138
## 4      6 1                               deep 0.0001029919  -0.005993941
##    Mallow's Cp
## 62    5.003969
## 33    3.684101
## 52    4.724219
## 54    4.775293
## 63    7.000000
## 43    5.683889
## 57    6.723519
## 59    6.767418
## 53    6.190730
## 56    6.255739
## 17    4.447461
## 38    5.442477
## 34    5.746530
## 40    5.906325
## 58    8.021891
## 12    5.205636
## 61    8.169637
## 27    6.388125
## 48    7.402347
## 32    6.550123
## 44    7.567646
## 50    7.807853
## 39    6.950150
## 22    7.073917
## 3     5.395638
## 18    6.416857
## 42    8.432342
## 49    8.600417
## 16    6.684767
## 28    8.056761
## 8     7.109816
## 26    8.422587
## 47   37.292359
## 60   38.175985
## 37   38.038920
## 55   39.042363
## 24   38.370340
## 31   39.247886
## 45   40.155611
## 51   40.255461
## 21   39.287183
## 25   40.107658
## 41   40.382987
## 10   39.632952
## 46   41.234303
## 15   39.769746
## 14   39.942091
## 36   40.957170
## 29   41.409026
## 35   41.835549
## 5    40.387035
## 6    40.493125
## 11   41.262841
## 30   42.469948
## 20   41.745383
## 19   42.264283
## 7    42.737798
## 2    43.006675
## 1    43.017547
## 23   44.699262
## 9    44.967398
## 13   44.994048
## 4    44.781340

By adjusted R-Square, it appears that the three variable model we used before fares very well compared to the full model. Attitude alone is also quite good in comparison and could be chosen by parsimony.


Week 3 - Logistic regression

date()
## [1] "Thu Nov 25 21:20:59 2021"

Introduction

This week, we are analyzing and modeling data on two questionnaires related to student performance and alcohol consumption. Further description is available here. Since the description is relevant for analysis we reprint it verbatim:

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

The dataset has been preprocessed by joining data from two courses, Math and Portugese language. Variables alc_use and high_use have been added in the wrangling phase.

Dataset

Begin by loading the analysis data. Let’s use the “official” csv-file; an R script to produce the wrangled data is in the repo.

rm(list=ls())

# read in the data, convert strings to factors
alc <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv",
                header = TRUE, sep = ",", stringsAsFactors = TRUE)

dim(alc)
## [1] 370  51

There are 370 rows and 51 columns.

str(alc)
## 'data.frame':    370 obs. of  51 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ n         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ id.p      : int  1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
##  $ id.m      : int  2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ failures.p: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ paid.p    : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ absences.p: int  4 2 8 2 2 2 0 0 6 10 ...
##  $ G1.p      : int  13 13 14 10 13 11 10 11 15 10 ...
##  $ G2.p      : int  13 11 13 11 13 12 11 10 15 10 ...
##  $ G3.p      : int  13 11 12 10 13 12 12 11 15 10 ...
##  $ failures.m: int  1 2 0 0 2 0 2 0 0 0 ...
##  $ paid.m    : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
##  $ absences.m: int  2 2 8 2 8 2 0 2 12 10 ...
##  $ G1.m      : int  7 8 14 10 10 12 12 8 16 10 ...
##  $ G2.m      : int  10 6 13 9 10 12 0 9 16 11 ...
##  $ G3.m      : int  10 5 13 8 10 11 0 8 16 11 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
##  $ cid       : int  3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...

Seems like there are mostly integer and character (factorized) valued variables, as well as at least one logical (high_use) and float (alc_use).

# (Type for factor is int.)
table(sapply(alc, typeof))
## 
##  double integer logical 
##       1      49       1
# Check for zero variance (negate characters to get "not all duplicated")
sort(
  sapply(alc, function(x) { 
    ifelse(is.numeric(x), var(x), !all(duplicated(x)[-1L])) 
    }))
##            n   failures.p     failures   traveltime   failures.m    studytime 
## 0.000000e+00 2.398667e-01 3.109939e-01 4.916502e-01 5.049513e-01 7.189922e-01 
##         Dalc       famrel     freetime      alc_use       school          sex 
## 8.032594e-01 8.304695e-01 9.712224e-01 9.958178e-01 1.000000e+00 1.000000e+00 
##      address      famsize      Pstatus         Mjob         Fjob       reason 
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
##     guardian    schoolsup       famsup   activities      nursery       higher 
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
##     internet     romantic         paid       paid.p       paid.m     high_use 
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
##         Medu         Fedu        goout          age         Walc       health 
## 1.173984e+00 1.179697e+00 1.273720e+00 1.393987e+00 1.666366e+00 1.981220e+00 
##         G2.p         G1.p           G1           G2         G3.p           G3 
## 6.078518e+00 6.512854e+00 7.301699e+00 7.914627e+00 8.665092e+00 1.080868e+01 
##         G1.m         G2.m         G3.m   absences.p     absences   absences.m 
## 1.119153e+01 1.444749e+01 2.124131e+01 2.330626e+01 3.029392e+01 5.876224e+01 
##          cid         id.m         id.p 
## 1.143917e+04 1.274368e+04 3.041907e+04
# Seems OK, only `n` has zero variance.

Variables correlated to alcohol consumption

Let’s apply a data-driven approach to discover the top 4 variables linked to the alcohol consumption. Intuitively, family relationships and characteristics, study time, grades, and peer behavior should have an effect.

We want to compute association between all the variables. To allow for factors, we employ a stackoverflow solution

# From 
# https://stackoverflow.com/questions/52554336/plot-the-equivalent-of-correlation-matrix-for-factors-categorical-data-and-mi/56485520#56485520

library(tidyverse)
library(rcompanion)


# Calculate a pairwise association between all variables in a data-frame. In particular nominal vs nominal with Chi-square, numeric vs numeric with Pearson correlation, and nominal vs numeric with ANOVA.
# Adopted from https://stackoverflow.com/a/52557631/590437
mixed_assoc = function(df, cor_method="spearman", adjust_cramersv_bias=TRUE){
    df_comb = expand.grid(names(df), names(df),  stringsAsFactors = F) %>% set_names("X1", "X2")

    is_nominal = function(x) class(x) %in% c("factor", "character")
    # https://community.rstudio.com/t/why-is-purr-is-numeric-deprecated/3559
    # https://github.com/r-lib/rlang/issues/781
    is_numeric <- function(x) { is.integer(x) || is_double(x)}

    f = function(xName,yName) {
        x =  pull(df, xName)
        y =  pull(df, yName)

        result = if(is_nominal(x) && is_nominal(y)){
            # use bias corrected cramersV as described in https://rdrr.io/cran/rcompanion/man/cramerV.html
            cv = cramerV(as.character(x), as.character(y), bias.correct = adjust_cramersv_bias)
            data.frame(xName, yName, assoc=cv, type="cramersV")

        }else if(is_numeric(x) && is_numeric(y)){
            correlation = cor(x, y, method=cor_method, use="complete.obs")
            data.frame(xName, yName, assoc=correlation, type="correlation")

        }else if(is_numeric(x) && is_nominal(y)){
            # from https://stats.stackexchange.com/questions/119835/correlation-between-a-nominal-iv-and-a-continuous-dv-variable/124618#124618
            r_squared = summary(lm(x ~ y))$r.squared
            data.frame(xName, yName, assoc=sqrt(r_squared), type="anova")

        }else if(is_nominal(x) && is_numeric(y)){
            r_squared = summary(lm(y ~x))$r.squared
            data.frame(xName, yName, assoc=sqrt(r_squared), type="anova")

        }else {
            warning(paste("unmatched column type combination: ", class(x), class(y)))
        }

        # finally add complete obs number and ratio to table
        result %>% mutate(complete_obs_pairs=sum(!is.na(x) & !is.na(y)), complete_obs_ratio=complete_obs_pairs/length(x)) %>% rename(x=xName, y=yName)
    }

    # apply function to each variable combination
    map2_df(df_comb$X1, df_comb$X2, f)
}

Then we can use the function for correlations

# remove boolean high_use (or recode it to integer/factor)
mixed.cors <- mixed_assoc(alc[, -which(names(alc) %in% c("high_use"))])
library(corrplot)

cor.matrix <- 
  mixed.cors %>% 
  select(x, y, assoc) %>% 
  spread(y, assoc) %>% 
  select(-x) %>%
  replace(is.na(.), 0)

rownames(cor.matrix) <- colnames(cor.matrix)

corrplot(as.matrix(cor.matrix), method = "color",
         tl.cex = 0.5, tl.col = "black")

From the correlation plot, we see that, as expected by definition, Dalc and Walc correlate positively with alc_use. It would be interesting to include these for further analysis to explain the drinking behavior. Since we’re limited to four variables, we choose a coarse approach and only take alc_use.

The grade variables correlate negatively with alc_use; based on the data annotation in the beginning, it is easiest to choose G3 for further analysis. The famrel correlates negatively with alc_use, and seems quite interesting. goout, sex and studytime are interesting as well. Let’s make sure these variables don’t correlate too much.

interesting_vars <- c("alc_use", "G3", "famrel", "goout", "sex", "studytime")

cor.matrix <- 
  alc %>%
    select(interesting_vars) %>%
    mixed_assoc(cor_method = "pearson") %>%
    select(x, y, assoc) %>% 
    spread(y, assoc) %>%
    select(-x)

rownames(cor.matrix) <- colnames(cor.matrix)

corrplot(as.matrix(cor.matrix), method = "color",
         tl.cex = 1, tl.col = "black", addCoef.col = 'black')

Seems good!

# Select the subset
alc.sub <- alc[, interesting_vars]

str(alc.sub)
## 'data.frame':    370 obs. of  6 variables:
##  $ alc_use  : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ G3       : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ famrel   : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ goout    : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ sex      : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ studytime: int  4 2 1 3 3 3 3 2 4 4 ...

Further exploration

Begin with the GGally::ggpairs

library(GGally)
library(ggplot2)

p <- ggpairs(alc.sub, mapping = aes(col = sex, alpha = 0.3), 
             lower = list(combo = wrap("facethist", bins = 20)),  
             upper = list(continuous = wrap("cor", size = 3)))

p

We see that

  • There is approximately equal number of males (n=175) and females (n=195)
  • Males tend to use more alcohol
  • Distribution of grade G3 is similar for both sexes, and is not normal (by shapiro.test()), has a peak and is left skewed
  • famrel correlates with alc_use, but significantly only for males
  • Males report better quality family relationships
  • Males use less time for studying
  • goout correlates positively with alcohol use with a high significance
  • studytime correlates negatively with alcohol use with a high significance

All in all, there are several hypotheses

  • Getting a better grade while being male is associated with lower alcohol use
  • Family relationships affect alcohol use
  • Alcohol is used when going out
  • Alcohol is not used when studying

Let’s take a closer look on the quality of family relationship versus alcohol use

library(tidyverse)

alc_var <- function(alc, expl.variable) {
  expl.vars <- c(expl.variable, "alc_use", "sex")
  
  # Tidyverse black magic
  res <- 
    alc %>%
      select(one_of(expl.vars)) %>%
      count(!!!sapply(expl.vars, as.symbol))
  
  g <- ggplot(res, aes(x = .data[[expl.variable]], y = alc_use, col = sex)) + 
    geom_point(aes(size = n), alpha = 0.5) + geom_smooth(method = "loess")
  
  return(g)
}

alc_var(alc, "famrel")

With a bit of caution, it appears than men with low quality famrel use more alcohol, as there is an increase in alc_use when famrel goes from medium to bad. However, the relationship is roughly linear (without deep analysis):

summary(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))
## 
## Call:
## lm(formula = alc_use ~ famrel, data = alc[alc$sex == "M", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7458 -0.9118 -0.1898  0.8102  3.0882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.30184    0.38377   8.604 4.55e-15 ***
## famrel      -0.27800    0.09368  -2.968  0.00343 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.128 on 173 degrees of freedom
## Multiple R-squared:  0.04844,    Adjusted R-squared:  0.04294 
## F-statistic: 8.807 on 1 and 173 DF,  p-value: 0.003427
# make the plot tighter
op <- par(mfrow=c(2, 2),
          mar = c(4, 4, 2, 1),
          mgp = c(2, 1, 0))

plot(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))

par(op) # return original par

library(car)
# There is no heteroscedasticity by Breusch-Pagan test
car::ncvTest(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.398195, Df = 1, p = 0.52802

Next, plot the other variables as well

library(ggpubr)

g1 <- alc_var(alc, "goout")
g2 <- alc_var(alc, "sex")
g3 <- alc_var(alc, "studytime")
g4 <- alc_var(alc, "G3")

ggarrange(g1, g2, g3, g4,
          ncol = 4, nrow = 1, common.legend = TRUE, 
          legend = "bottom")

When goout increases, so does the alcohol use. Males appear to use more alcohol. As studytime increases, alcohol usage drops. G3 does not have a clear linear pattern with alcohol use. Otherwise, the above hypotheses are OK.

Logistic regression

Now we are ready to fit the logistic regression model with G3, famrel, goout, sex and studytime.

model.1 <- glm(paste0("high_use", " ~ ", paste(interesting_vars[-1], collapse=" + ")),
               family = "binomial", data = alc)

summary(model.1)
## 
## Call:
## glm(formula = paste0("high_use", " ~ ", paste(interesting_vars[-1], 
##     collapse = " + ")), family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6895  -0.7784  -0.4891   0.8155   2.6807  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.85026    0.85893  -0.990  0.32222    
## G3          -0.03667    0.04031  -0.910  0.36294    
## famrel      -0.41604    0.13989  -2.974  0.00294 ** 
## goout        0.77038    0.12402   6.212 5.25e-10 ***
## sexM         0.80171    0.26695   3.003  0.00267 ** 
## studytime   -0.46034    0.17229  -2.672  0.00754 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 369.86  on 364  degrees of freedom
## AIC: 381.86
## 
## Number of Fisher Scoring iterations: 4

We got a model with four significant variables. G3 is not significant, remove it and refit.

model.2 <- glm(high_use ~ famrel + goout + sex + studytime,
               family = "binomial", data = alc)

summary(model.2)
## 
## Call:
## glm(formula = high_use ~ famrel + goout + sex + studytime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5891  -0.7629  -0.4976   0.8304   2.6937  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2672     0.7312  -1.733  0.08309 .  
## famrel       -0.4193     0.1399  -2.996  0.00273 ** 
## goout         0.7873     0.1230   6.399 1.57e-10 ***
## sexM          0.7959     0.2669   2.982  0.00286 ** 
## studytime    -0.4814     0.1711  -2.814  0.00490 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 370.69  on 365  degrees of freedom
## AIC: 380.69
## 
## Number of Fisher Scoring iterations: 4

There is a lot of output

  • Fitted model is \(log \frac{Pr(high\_use =1)}{ 1 - Pr(high\_use = 1)} = log \frac{Pr(high\_use =1)}{Pr(high\_use = 0)} = -1.2672 - 0.4193*famrel + 0.7873 * goout + 0.7959 * sexM - 0.4814 * studytime\)
  • The coefficients represent a change in the log odds of the response when the explanatory variable increases by one, conditional on the other explanatory variables remaining constant
  • All of the variables (and the non-zero intercept with 0.10 p-value cutoff) are significant by Wald test
  • Deviance is similar in concept to the squared error in ordinary regression and quantifies difference between the observed and predicted proportions/probabilities
  • Residual deviance is the sum of squares for residuals
  • Null deviance is the deviance for null model (high_use ~ 1)
  • AIC is the Akaike information criterion and can be used for model selection
  • Number of Fisher Scoring iterations is related to how many iterations are need to fit the model
# Residual deviance
sum(residuals(model.2, type = "deviance")^2)
## [1] 370.6854

See if residuals are roughly linear, otherwise we can’t use GLM.

# Plot residuals and check that linearity is OK
plot(model.2, which = 1)  # -> Looks OK.

We can also test with ANOVA if null model is significantly different from our model

# see http://homepage.stat.uiowa.edu/~rdecook/stat3200/notes/LogReg_part2_4pp.pdf
# and https://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_logistic_regression_glm.pdf
# and https://stats.stackexchange.com/questions/59879/

# 1-pchisq(model.2$null.deviance-model.2$deviance, model.2$df.null-model.2$df.residual)
null <- glm(high_use ~ 1, family = "binomial", data = alc)
anova(null, model.2, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ 1
## Model 2: high_use ~ famrel + goout + sex + studytime
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       369     452.04                          
## 2       365     370.69  4   81.354 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We conclude that at least one of the explanatory variables has non-zero coefficient by this test.

Compute odds ratios and confidence intervals

OR <- coef(model.2) %>% exp
CI <- confint(model.2) %>% exp

list(OR = OR, CI = CI)
## $OR
## (Intercept)      famrel       goout        sexM   studytime 
##   0.2816141   0.6575181   2.1974324   2.2165443   0.6179355 
## 
## $CI
##                  2.5 %    97.5 %
## (Intercept) 0.06545486 1.1622100
## famrel      0.49791884 0.8636137
## goout       1.73873662 2.8198312
## sexM        1.31841735 3.7630180
## studytime   0.43751933 0.8576353

The odds ratio for famrel is 0.66 and 95% CI is [0.50, 0.86]. Thus, increase by 1 unit in famrel is associated with a decrease in the odds of high_use by 14-50%.

Variables goout (increases high_use odds) and studytime (decreases high_use odds) are interpreted analogously and neither containts 1 in the 95% CI.

The sexM is interpreted in relation to implicit sexF: being male increases the high_use odds by around 122% with 95% CI of [1.32, 3.76].

It seems that the results correspond to the hypotheses. It feels like famrel would require more work although the interaction term of sex*famrel is not significant when added to model.2 (data not shown).

Predictive power

observed <- alc$high_use
predicted <- predict(model.2, type = "response") > 0.5

# 2x2 tabulation, confusion matrix
table(observed, predicted)
##         predicted
## observed FALSE TRUE
##    FALSE   230   29
##    TRUE     59   52

Most cases are classified correctly.

Below are the predictions versus actual high_use values, plotted separately for each explanatory variable.

# Get predictions, compare against true values
predicted_abs <- predict(model.2, type = "response")

df <- data.frame(cbind(observed, predicted, predicted_abs, 
                       alc[, interesting_vars[3:6]]))

df$matches <- df$observed == df$predicted

# Plot by explanatory variables
prob.expl <- function(df, ind.var, expl.var, col.var) {

  g <- ggplot(df, aes(x = .data[[ind.var]], 
                      y = .data[[expl.var]])) + 
    geom_point(alpha = 0.5, aes(col = .data[[col.var]])) + 
    geom_smooth(method = "loess", col = "black", size = 0.5) +
    theme_bw()
  g
}
  
g1 <- prob.expl(df, "predicted_abs", "famrel", "matches")
g2 <- prob.expl(df, "predicted_abs", "goout", "matches")
g3 <- prob.expl(df, "predicted_abs", "sex", "matches")
g4 <- prob.expl(df, "predicted_abs", "studytime", "matches")

ggarrange(g1, g2, g3, g4, ncol = 2, nrow = 2, labels = c("A", "B", "C", "D"),
          common.legend = TRUE, legend = "bottom")

From the plot above we see where the predictions of high_useare right (greenish points) and wrong (red points). For famrel (A), sex (C) and studytime (D), the predictions seem to be false mostly around probability 0.5. With goout, high_use appears to be poorly predicted when goout is low and predicted value is in the range of [0.5, 0.7].

Quantify at which probability values mismatches tend to occur:

# Predictions equal observation
round(table(cut_interval(df[df$matches == TRUE, c("predicted_abs")], length=0.20)) /
  length(df$matches), 2)
## 
##   [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]   (0.8,1] 
##      0.36      0.22      0.08      0.08      0.02
# False predictions
round(table(cut_interval(df[df$matches == FALSE, c("predicted_abs")], length=0.20)) /
  length(df$matches), 2)
## 
##   [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] 
##      0.06      0.05      0.10      0.03

This confirms that most false predictions are made around the probability 0.5.

The training error of the model is 0.24, i.e. lowish. If we just guessed using, say, goout, the error would be around 0.28, which is not very bad. Of course the guessing is not very random, since we already know that goout is an important predictor.

# Training error, model.2
sum(df$matches == FALSE) / length(df$matches)
## [1] 0.2378378
# Guess high_use by goout
guesses <- rep(FALSE, times = nrow(df))
guesses[df$goout > mean(df$goout)] <- TRUE

table(df$observed, guesses)
##        guesses
##         FALSE TRUE
##   FALSE   198   61
##   TRUE     41   70
sum(guesses != df$observed) / nrow(df)
## [1] 0.2756757

Cross-validation

Follow the code in DataCamp for this one

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, df$predicted_abs)
## [1] 0.2378378
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model.2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2405405

The test set performance is a bit better than the DataCamp baseline, 0.25 versus 0.26.

Simple model selection

# Select all explanatory variables available
training.vars <- colnames(alc)[1:36] # Discard non-joined .m and .p variables

# Discard alcohol consumption variables and technical variables
training.vars <- training.vars[!(
  training.vars %in% c("Dalc", "Walc", "n", "id.p", "id.m"))]

Now, with enough compute we could just try all the variables, but that would amount to 2^32 = 4294967296 models. We instead use stepwise selection by AIC starting with the full model and progressing backwards. We also assume R handles the factor variable to dummy (treatment) encoding. (Note the problems of stepwise regression. It would be advisable to use other methods e.g. through Caret.)

library(MASS)

# Run stepwise AIC selection on full model, keep the model and its AIC
final.m <- stepAIC(
  # Full model
  glm(paste0("high_use", " ~ ", paste(training.vars[-1], collapse=" + ")),
             family = binomial, data = alc),
  direction = "backward", trace = FALSE, 
  # What to keep from models
  keep = function(model, aic) { list(model = model, aic = aic) } )

Then perform 10-fold CV for the models

# Init
CVs  <- rep(NULL, ncol(final.m$keep))  # Test error
ERRs <- rep(NULL, ncol(final.m$keep))  # Train error
AICs <- rep(NULL, ncol(final.m$keep))
Nvar <- rep(NULL, ncol(final.m$keep))

for (i in 1:ncol(final.m$keep)) {  # Each column is a model
  
  interim.m <- final.m$keep[, i][1]$model
  CVs[i] <- cv.glm(data = alc, cost = loss_func, glmfit = interim.m, K = 10)$delta[1]
  
  ERRs[i] <- loss_func(alc$high_use, predict(interim.m, type = "response"))
  
  AICs[i] <- final.m$keep[, i][2]$aic
  
  Nvar[i] <- length(final.m$keep[, i][1]$model$coefficients) - 1
}

CV.res <- data.frame(test = CVs, train = ERRs, AIC = AICs, Nvar = Nvar)
# Somewhat unsatisfactory graph with points

g1 <- ggplot(CV.res, aes(x = Nvar, y = test)) + geom_line() + 
  geom_point(aes(x = Nvar, y = test, color = AIC), size = 5, alpha = 0.8) +
  theme_bw() +  xlab("Number of variables") + ylab("Test set error") +
  ggtitle("10-fold CV on backward stepwise variable selection with AIC") +
  scale_color_gradient(low="blue", high="red")

g1

# Better graph with both test and training error, and AIC separately

errors <- gather(CV.res, type, error, c("test", "train"))

g1 <- ggplot(errors, aes(x = Nvar, color = type)) + geom_line(aes(y = error)) + 
  theme_bw() +  xlab("Number of variables") + ylab("Test set error")

g2 <- ggplot(errors, aes(x = Nvar)) + geom_line(aes(y = AIC)) + 
  theme_bw() +  xlab("Number of variables") + ylab("AIC")
  
g <- ggarrange(g1, g2, ncol = 2, nrow = 1, widths = c(2, 1.5),
          common.legend = FALSE, legend = "left")

annotate_figure(g, top = text_grob("10-fold CV on backward stepwise variable selection with AIC"))

The final model was

summary(final.m)
## 
## Call:
## glm(formula = high_use ~ sex + address + famsize + reason + guardian + 
##     activities + romantic + famrel + goout + health + absences, 
##     family = binomial, data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8167  -0.7154  -0.4262   0.5635   2.7612  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.14937    0.86318  -2.490 0.012772 *  
## sexM              0.98275    0.28722   3.422 0.000623 ***
## addressU         -0.89491    0.33661  -2.659 0.007847 ** 
## famsizeLE3        0.47009    0.29971   1.569 0.116759    
## reasonhome        0.31183    0.35612   0.876 0.381229    
## reasonother       1.24517    0.47895   2.600 0.009328 ** 
## reasonreputation -0.28654    0.36969  -0.775 0.438278    
## guardianmother   -0.68303    0.32428  -2.106 0.035179 *  
## guardianother     0.41726    0.71637   0.582 0.560256    
## activitiesyes    -0.51481    0.28383  -1.814 0.069710 .  
## romanticyes      -0.50684    0.30464  -1.664 0.096171 .  
## famrel           -0.50627    0.15580  -3.249 0.001156 ** 
## goout             0.91148    0.13829   6.591 4.36e-11 ***
## health            0.16738    0.10359   1.616 0.106138    
## absences          0.09152    0.02373   3.856 0.000115 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 335.02  on 355  degrees of freedom
## AIC: 365.02
## 
## Number of Fisher Scoring iterations: 5

Removing all non-significant explanatory variables and iterating yields a model with 0.20 prediction error and 0.22 test set error in 10-fold CV:

parsimonious.m <- glm(formula = high_use ~ sex + address + reason + guardian + 
    activities + famrel + goout + absences, family = binomial, data = alc)

summary(parsimonious.m)
## 
## Call:
## glm(formula = high_use ~ sex + address + reason + guardian + 
##     activities + famrel + goout + absences, family = binomial, 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8386  -0.7558  -0.4676   0.6257   2.8133  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.65723    0.77828  -2.129 0.033225 *  
## sexM              1.09542    0.27999   3.912 9.14e-05 ***
## addressU         -0.84765    0.33097  -2.561 0.010435 *  
## reasonhome        0.24841    0.34895   0.712 0.476533    
## reasonother       1.06110    0.46814   2.267 0.023412 *  
## reasonreputation -0.35044    0.36541  -0.959 0.337531    
## guardianmother   -0.65433    0.31871  -2.053 0.040071 *  
## guardianother     0.27975    0.69303   0.404 0.686459    
## activitiesyes    -0.55234    0.28067  -1.968 0.049075 *  
## famrel           -0.45739    0.15011  -3.047 0.002311 ** 
## goout             0.87831    0.13550   6.482 9.06e-11 ***
## absences          0.08739    0.02335   3.743 0.000182 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 341.90  on 358  degrees of freedom
## AIC: 365.9
## 
## Number of Fisher Scoring iterations: 5
loss_func(alc$high_use, predict(parsimonious.m, type = "response"))
## [1] 0.1972973
cv.glm(data = alc, cost = loss_func, glmfit = parsimonious.m, K = 10)$delta[1]
## [1] 0.2243243

Week 4 - Clustering and classification

date()
## [1] "Thu Nov 25 21:21:33 2021"

Introduction

This week, we are analyzing and modeling a dataset from MASS package, “Housing Values in Suburbs of Boston”. References and further description can be found here.

Dataset

rm(list=ls())

library(MASS)
boston <- MASS::Boston

dim(boston)
## [1] 506  14

The data hase 506 rows and 14 columns.

str(boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Most of the columns are double valued, chas and rad are integer typed.

table(sapply(boston, typeof))
## 
##  double integer 
##      12       2

Summary on the variables is shown below. crim is per capita crime rate by town, tax is a tax rate. chas is a dummy variable describing relative location to Charles River (1 if tract bounds river). Full descriptions are here.

summary(boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Correlation plot shows many strong correlations (without going into p-values):

library(corrplot)
library(tidyverse)

cor.matrix <- cor(boston)

rownames(cor.matrix) <- colnames(cor.matrix)

corrplot(as.matrix(cor.matrix), method = "ellipse", 
         order = "AOE", number.cex = 0.5,  # Use some nice ordering
         tl.cex = 1, tl.col = "black", addCoef.col = 'black')

We can filter and show correlations whose absolute value is above 0.7. Only show significant correlation coefficients. We see that e.g. dis and age correlate negatively, while nox and age correlate positively.

cor.matrix.filtered <- cor(boston)

cor.matrix.filtered <- apply(cor.matrix.filtered, 2, function(x) {
  ifelse(abs(x) > 0.7, x, 0)
})

rownames(cor.matrix.filtered) <- colnames(cor.matrix.filtered)

# Don't show zeroes. Get the order manually to match that above.
ord <- corrMatOrder(cor.matrix, order="AOE")

cor.matrix.filtered <- cor.matrix.filtered[ord, ord]
col <- ifelse(abs(cor.matrix.filtered - 0) < .Machine$double.eps, "white", "black") #[ord, ord]

# Test for correlation significance.
testRes = cor.mtest(boston, conf.level = 0.95)

corrplot(as.matrix(cor.matrix.filtered),
         method = "ellipse", order = "original", number.cex = 0.5,
         tl.cex = 1, tl.col = "black",
         p.mat = testRes$p, insig = "blank", na.label = " ", 
         use = "complete.obs", addCoef.col = col)

Plot the variables that correlate highly with some other variable

interesting <- which(
  abs(cor.matrix.filtered) > 0.7 & abs(cor.matrix.filtered) < 1, 
  arr.ind = TRUE)

boston.cor <- boston[, which(
  colnames(boston) %in% rownames(interesting))]

# pairs(boston.cor, col = boston$chas)

library(GGally)

lower <- function(data, mapping, method="loess", ...){
      p <- ggplot(data = data, mapping = mapping) + 
      geom_point() + 
      geom_smooth(method=method, ...)
      p
}

p <- ggpairs(boston.cor, mapping = aes(alpha = 0.3), 
             lower = list(continuous = lower),  
             upper = list(continuous = wrap("cor", size = 2)))

p

We see that e.g. nox and age have a clear non-linear pattern (modeling with lm reveals heteroscedasticity of the residuals vs. fitted values). Other variables produce two groups of points with rad or tax.

Then plot the other ones that do not correlate strongly.

boston.cor <- boston[, which(!  # not
  colnames(boston) %in% c(rownames(interesting), "chas"))]

# c("nox", "dis", "tax", "indus", "age", "rad", "medv", "lstat")

# pairs(boston.cor, col = boston$chas)

lower <- function(data, mapping, method="loess", ...){
      p <- ggplot(data = data, mapping = mapping) + 
      geom_point() + 
      geom_smooth(method=method, ...)
      p
}

p <- ggpairs(boston.cor, mapping = aes(alpha = 0.3), 
             lower = list(continuous = lower),  
             upper = list(continuous = wrap("cor", size = 2)))

p

Here we also see some interesting patterns, which are left for future analyses.

Then plot variable distributions, stratify by binary chas

library(reshape2)

boston$chas <- as.factor(boston$chas)

boston.m <- 
  boston %>%
    melt(id.vars = c("chas"))

ggplot(boston.m, aes(value, col = chas)) +
  geom_histogram(aes(fill = chas)) + facet_wrap(~variable, scales = "free")

Most of the distributions are not close to normal, and most observations belong to chas == 0 class. Distributions of indus, tax and rad have two peaks.

Standardization

boston.std <- as.data.frame(
  scale(boston[, which(!(colnames(boston) %in% c("chas")))]))

boston.std$chas <- boston$chas

summary(boston.std)
##       crim                 zn               indus              nox         
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-1.4644  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.9121  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.1441  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.: 0.5981  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv         chas   
##  Min.   :-1.9063   0:471  
##  1st Qu.:-0.5989   1: 35  
##  Median :-0.1449          
##  Mean   : 0.0000          
##  3rd Qu.: 0.2683          
##  Max.   : 2.9865
apply(boston.std, 2, var)
##       crim         zn      indus        nox         rm        age        dis 
## 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 
##        rad        tax    ptratio      black      lstat       medv       chas 
## 1.00000001 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 0.06451297

All variables except the factorized chas are mean centered, and have unit variance. Plot the distributions for clarity

boston.m <- 
  boston.std %>%
    melt(id.vars = c("chas"))

ggplot(boston.m, aes(value, col = chas)) +
  geom_histogram(aes(fill = chas)) + facet_wrap(~variable, scales = "free")

Create categorical variable of crime rate by quantiles

bins <- quantile(boston.std$crim)

crime <- cut(boston.std$crim, breaks = bins, 
             include.lowest = TRUE, 
             labels = c("low", "med_low", "med_high", "high"))

boston.std <- dplyr::select(boston.std, -crim)

boston.std$crime <- crime

Divide into train and test set (follow course’s Datacamp exercise)

# standardize the chas as we need later
boston.std$chas <- scale(as.numeric(boston.std$chas))

n <- nrow(boston.std)

set.seed(123)  # for reproducibility
# sample n * 0.8 entries from 1:n
ind <- sample(n,  size = n * 0.8)

train <- boston.std[ind,]
test <- boston.std[-ind,]

# save the correct classes from test data
test.correct.classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

LDA

Fit LDA on all variables as instructed. Note that none of the variable distribution is normal by Shapiro-Wilk normality test

# sapply(boston.std %>% select(-chas) %>% select(-crime), shapiro.test)

# fit LDA
lda.fit <- lda(crime ~ ., data = train)

# plot biplot (follow course's Datacamp exercise)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", 
                       tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2.25)

From the biplot we see high crime rates are separated from the others well. rad, zn, (indus and nox) stand out from other variables, with rad being approximately orthogonal to the other ones. Medium high crime rates are separated pretty well from the low ones. Medium low rates overlap with low and medium high rates. There are a few outliers in all classes that are far from the center of mass of their respective group.

# print lda.fit details
# The prior probabilities seem to reflect the distribution in train data, 
# but are not identical
table(train$crime) / nrow(train)
## 
##       low   med_low  med_high      high 
## 0.2549505 0.2500000 0.2500000 0.2450495
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2500000 0.2500000 0.2450495 
## 
## Group means:
##                   zn      indus        nox          rm        age        dis
## low       1.01866313 -0.9066422 -0.8835724  0.38666334 -0.9213895  0.9094324
## med_low  -0.07141687 -0.3429155 -0.5768545 -0.09882533 -0.3254849  0.3694282
## med_high -0.40148591  0.2162741  0.3637157  0.12480815  0.4564260 -0.3720478
## high     -0.48724019  1.0149946  1.0481437 -0.47733231  0.8274496 -0.8601246
##                 rad        tax    ptratio      black        lstat        medv
## low      -0.6819317 -0.7458486 -0.4234280  0.3729083 -0.766883775  0.47009410
## med_low  -0.5395408 -0.4205644 -0.1079710  0.3103406 -0.164921798  0.01548761
## med_high -0.4349280 -0.3191090 -0.2716959  0.1049654  0.009720124  0.19440747
## high      1.6596029  1.5294129  0.8057784 -0.6383964  0.900379309 -0.71294711
##                 chas
## low      -0.08120770
## med_low   0.03952046
## med_high  0.19544522
## high     -0.03371693
## 
## Coefficients of linear discriminants:
##                  LD1         LD2        LD3
## zn       0.148390645  0.74870532 -1.0874785
## indus    0.040971465 -0.38126652 -0.1619456
## nox      0.312458150 -0.67564471 -1.3104018
## rm       0.011086947 -0.16058718 -0.1572603
## age      0.283482486 -0.38817624 -0.1013491
## dis     -0.079848789 -0.38493775  0.2108038
## rad      3.718978412  0.93123532 -0.4706522
## tax     -0.015197127  0.04230505  1.2889075
## ptratio  0.180294774 -0.01592588 -0.3558490
## black   -0.136724112  0.02948075  0.1288959
## lstat    0.145739238 -0.37823065  0.3345688
## medv     0.061327205 -0.53906503 -0.1509890
## chas     0.002460776  0.03963849  0.1699312
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9523 0.0364 0.0113

Prediction with LDA

# predict the classes
test.predictions <- predict(lda.fit, test)

# tabulate the results
table(test.predictions$class, test.correct.classes)
##           test.correct.classes
##            low med_low med_high high
##   low       10       2        1    0
##   med_low   10      17        9    0
##   med_high   4       6       13    1
##   high       0       0        2   27

High class is predicted well, as was expected. The other classes are predicted less accurately, med_high is often predicted as med_low, and low as med_low. Thus with this model we can only reliably separate high from other cases.

K-means

Reload boston and standardize.

boston <- MASS::Boston

boston <- scale(boston)

# Get pairwise distances 
boston.d <- dist(boston, method = "euclidean")
# Test different number of clusters
ks <- 1:15

# Determine optimal number of cluster by total within cluster 
# sum of squares

# We run k-means on the scaled data and not on the distances, 
# since k-means is not applicable to be run on distance matrix alone, see e.g.
# https://stackoverflow.com/questions/43512808/ and
# https://stats.stackexchange.com/questions/32925/
twcss <- sapply(ks, function(x) { kmeans(boston, centers = x)$tot.withinss})

# Plot
qplot(x = ks, y = twcss, geom = "line") + theme_bw()

Well, it is difficult to see what is the optimal number of clusters. Three or four is probably OK, two likely not large enough although that is where the largest drop in twcss occurs. Let’s run a (classical) multidimensional scaling in 2D on the distance matrix and see if there are clear clusters.

mds.fit <- cmdscale(boston.d, k = 2)

plot(mds.fit[, 1], mds.fit[, 2], 
     xlab = "Dimension 1", ylab = "Dimension 2")

Well, it seems like there are two clusters obtained with MDS.

Let’s run a k-medoid on the distance matrix and use the silhouette plot this time.

library(cluster)
library(factoextra)
# adapted from https://www.r-bloggers.com/2019/01/10-tips-for-choosing-the-optimal-number-of-clusters/

fviz_nbclust(as.matrix(boston.d), pam, method = "silhouette", k.max = 12) +
  theme_minimal() + ggtitle("Silhoutte plot for k-medoids")

With k-medoids, we get two clusters as the optimal.

Finally, plot dendrogram

# Choose average clustering as a "generic" one
hier.clust <-  hclust(boston.d, method = "average")
plot(hier.clust, cex = 0.5)
abline(h = 5.75, col = "red")

Hierarchical clustering suggests three big clusters and one or two smaller ones.

Since in k-means we noted that 3-4 clusters could be good, decrease the number to 3 based on the short analysis with MDS, k-medoids and hclust.

Let’s visualize the results

km <- kmeans(boston, centers = 3)

cols <- rep("black", nrow(boston))
cols[km$cluster == 2] <- "blue"
cols[km$cluster == 3] <- "red"

# make more compact, hide axes
pairs(boston, col = cols,
      gap = 0, xaxt = "n", yaxt = "n", pch = 3)

Pairs plot does not appear to support the hypothesis of three clusters explainable by any pair of variables, although there are clearly many plots with two groups e.g. with rad variable.

# Visualize k-means result in 2D with principal components 
# (from PCA, shows also the percentage of variance explained).

factoextra::fviz_cluster(km, data = boston,
             #palette = c("#2E9FDF", "#00AFBB", "#E7B800"), 
             geom = "point",
             ellipse.type = "convex", 
             ggtheme = theme_bw())

km2 <- kmeans(boston, centers = 2)

factoextra::fviz_cluster(km2, data = boston,
             #palette = c("#2E9FDF", "#00AFBB", "#E7B800"), 
             geom = "point",
             ellipse.type = "convex", 
             ggtheme = theme_bw())

Two clusters overlap less than three, but clustering into three groups is not too bad.

LDA on K-means clusters

Run LDA on k-means clusters (without train-test split).

boston <- as.data.frame(scale(MASS::Boston))

km <- kmeans(boston, centers = 3)

boston$class <- km$cluster
# fit LDA
lda.fit.km <- lda(class ~ ., data = boston)

# plot the lda results
plot(lda.fit.km, dimen = 2, col = boston$class, pch = boston$class) 
lda.arrows(lda.fit.km, myscale = 2.5)

Separation of the groups is similar to what we had before: one clearly different, and two overlapping. rad is again important variable to separate a group from the others. This time also tax is similar in direction and magnitude as rad, which we expected from correlation analysis. age and dis separate clusters 2 and 3 stronger than the other variables. They are more or less orthogonal to rad and tax.

# tabulate how k-means clusters are reproduced
table(predict(lda.fit.km)$class, km$cluster)
##    
##       1   2   3
##   1 133   0   0
##   2   0 152   8
##   3   9  10 194
# Seems to be reproduced pretty well.

3D plots

Reuse train set from above, follow instructions of the course

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)

plot_ly(x = matrix_product$LD1, 
        y = matrix_product$LD2, 
        z = matrix_product$LD3, 
        type = "scatter3d", mode = "markers", 
        color = train$crime)

Plot the same with three k-means clusters

km <- kmeans(train %>% select(-crime), centers = 3)

plot_ly(x = matrix_product$LD1, 
        y = matrix_product$LD2, 
        z = matrix_product$LD3, 
        type = "scatter3d", mode = "markers", 
        color = as.factor(km$cluster))

Now it is more apparent how there might be three clusters. In both plots there is one high crime rate cluster, and the rest of data points form two clouds: a low-to-med_low cluster and a med_high_to_med_low cluster.

All in all, for prediction tasks, reliable models can be built for detecting high and non-high binary classes. Splitting non-high into two groups will decrease prediction accuracy.


(more chapters to be added similarly as we proceed with the course!)